Fermi-Dirac statistics, Bose-Einstein statistics, Maxwell-Boltzmann distribution, Planck’s law Binomial distribution
Exponential distribution Gumbel distribution
Logistic distribution
Log normal distribution
Normal distribution
Pareto distribution
Poisson distribution
Student t
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import plotly.graph_objects as go
import pandas as pd
import seaborn as sns
from copy import deepcopy
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from scipy.stats import gumbel_r, gompertz, maxwell, pareto, levy, lognorm, weibull_max, expon, poisson
matplotlib.rcParams["figure.figsize"] = (10, 6)
matplotlib.rcParams["axes.titlesize"] = 16
matplotlib.rcParams["axes.labelsize"] = 13
matplotlib.rcParams["xtick.labelsize"] = 16
matplotlib.rcParams["ytick.labelsize"] = 16
matplotlib.rcParams["axes.grid"] = True
matplotlib.rcParams["xtick.major.size"] = 12
matplotlib.rcParams["ytick.major.size"] = 12
matplotlib.rcParams["xtick.major.width"] = 2
matplotlib.rcParams["ytick.major.width"] = 2
For the purposes of the current excercises, select a stream from a rainy area with relatively small discharge so that the effect of short but strong storms is visible. Good choices are small rivers from the north-eastern US, e.g. site 01589440. Retrieve at least 10 years of data.
Large amounts of historical surface water data are available from the United States Geological Survey (USGS) at https://waterdata.usgs.gov/nwis The goal of the project is to retrieve samples from the web interface manually, and then later automate the process by calling the web service as described at https://help.waterdata.usgs.gov/faq/automated-retrievals.
You can also use the dataset from the /home/course/Datasets/02-data/ folder as well.
Find a database of historic water level measurement of the Danube at Budapest
my_url = "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_00060=on&cb_00065=on&format=html&site_no=03014500&period=&begin_date=2010-02-08&end_date=2021-02-15"
#Downloading the webpage and storing its HTML right away.
resp_obj = urlopen(Request(my_url, headers={'User-Agent': 'Mozilla/6.0'}))
page_html = resp_obj.read()
resp_obj.close()
#Parsing the website's HTML with bs4. For some reason, I couldn't reach the body.table part in the traditional way.
#So, I had to go with a bruteforce method. Of course, if the table would've been consisted of more columns with
#other types of data, this code wouldn't have worked right away.
parsed_text = BeautifulSoup(page_html,"html.parser")
rows = [x for x in parsed_text.body.contents[11].contents[4].children]
rows = list(filter(lambda x: x != "\n", rows)) #filtering out newline characters
data = {} #Storing every row of the table in a key-value pair (dict)
for i in range(len(rows)):
row = rows[i]
datetime_timezone = row.findAll("td", {"nowrap":"nowrap"})[0].string
discharge_value = row.findAll("td", {"nowrap":"nowrap"})[1].span.string
discharge_comment = row.findAll("td", {"nowrap":"nowrap"})[1].sup.string
gage_height_value = row.findAll("td", {"nowrap":"nowrap"})[2].span.string
gage_height_comment = row.findAll("td", {"nowrap":"nowrap"})[2].sup.string
data[f"row_{i}"] = [datetime_timezone, discharge_value, discharge_comment, gage_height_value, gage_height_comment]
data_df = pd.DataFrame.from_dict(data, orient="index", columns = ["datetime_tz", "discharge_v", "discharge_c", "gage_h", "gage_c"])
data_df.head()
!curl "https://nwis.waterdata.usgs.gov/nwis/uv/?parameterCd=00060,00065&format=rdb&site_no=01321000&period=&begin_date=20100101&end_date=20191231&siteStatus=all" > data.csv
!head -n 40 data.csv
Load the downloaded data file into the processing environment paying attention to handling time stamps and perfoming the necessary data type conversions. Converting dates to floating point numbers such as unix time stamp or julian date usually makes handling time series easier. Plot the data for a certain interval to show that the effect of storms is clearly visible.
df = pd.read_csv("data.csv",sep='\t', comment='#',header=0)
df = df.drop(index=0)
df.head()
print(df.agency_cd.unique(), df.site_no.unique(), df.tz_cd.unique())
#These columns are irrelevant, because they contain the same data
df = df.drop(columns = ["agency_cd", "site_no","tz_cd"])
df.head()
df.dtypes #The columns has to be cast into the appropriate data formats
From the head of the .csv datafile we know what the codes are
106704 00065 Gage height, feet
106705 00060 Discharge, cubic feet per second
df = df.rename(columns = {"106704_00065": "gage", "106704_00065_cd": "gage_c", "106705_00060": "discharge", "106705_00060_cd":"discharge_c"})
df["datetime"] = pd.to_datetime(df.datetime)
df["discharge"] = df.discharge.astype(float)
df["gage"] = df.gage.astype(float)
df.head(), df.dtypes
#We can see that there are some NaN values, the ratio of these NaNs in each columns are the following:
df.isna().sum()/len(df)
df = df[df.discharge.isna() != True] #Getting rid of the rows where discharge is missing
df.isna().sum()/len(df)
Plot the histogram of water discharge values. Fit the data with an appropriate distribution function and bring arguments in favor of the choice of function.
def fitPlotDist(dist,to_fit,x, alpha = 1.0):
params = dist.fit(to_fit)
rv = dist(*params)
plt.plot(x, rv.pdf(x), label=dist.name, alpha = alpha)
sns.distplot(df.discharge,bins=2000, label="raw data with KDE", kde = True)
plt.ylabel("frequency")
x = np.linspace(min(df.discharge), max(df.discharge), 1000)
#fitPlotDist(gumbel_r,df.discharge,x)
#fitPlotDist(pareto, df.discharge, x)
#fitPlotDist(maxwell, df.discharge,x)
fitPlotDist(lognorm,df.discharge,x)
fitPlotDist(levy, df.discharge, x)
fitPlotDist(expon,df.discharge,x)
plt.xlim(0,5000)
plt.legend()
plt.title("Histogram of water discharge values")
The data visualized above shows, that from the three distribution I've fitted it, the Lévy distribution is the best fit. I didn't find anything about the Lévy distribution on Wikipedia that would support this claim.
However, I suspect, that the lognormal distribution would be more well-founded, because as Wikipedia states:
a log-normal process is the statistical realization of the multiplicative product of many independent random variables.
Which is true in the case of discharge values: it is affected by many variables be it meteorological or geological.
In case of small streams, storms and passing weather fronts with rain can significantly increase water discharge on a short time scale. Develop a simple algorithm to detect rainy events in the data. Plot the original time series and mark rainy events.
fig, ax1 = plt.subplots(figsize=[15,5])
ax1.plot(df.datetime,df.discharge, color="tab:orange", label="Discharge")
ax1.set_ylabel("Discharge [$ft^3$/s]",size=13)
plt.grid(ls = "dashed", color="tab:orange")
plt.legend(loc = 2,fontsize=12)
ax2 = ax1.twinx()
ax2.plot(df.datetime, df.gage, color="tab:blue",alpha = 0.6, label="Gage height")
ax2.set_ylabel("Gage height [ft]", size=13)
plt.grid(ls = "dotted",alpha = 0.5,color="tab:blue")
plt.legend(loc = 1, fontsize=12)
plt.title("Discharge and gage height plotted against time")
from scipy.stats import pearsonr
pearsonr(df[df.isna().sum(axis=1) < 1].discharge, df[df.isna().sum(axis=1) < 1].gage)
From this graph we can conclude, that the discharge at a given cross section is linearly dependant with the gage height, which seems intuitive. This claim is also supported by the strong positive correlation score.
'''fig, ax1= plt.subplots(figsize=[15,5])
ax1.plot(df.datetime,df.discharge.diff(), color="tab:orange", label="Change in discharge")
ax1.set_ylabel("Change in discharge [$ft^3$/s]",size=13)
plt.grid(ls = "dashed", color="tab:orange")
plt.legend(loc = 2)
ax2 = ax1.twinx()
ax2.plot(df.datetime, df.gage.diff(), color="tab:blue",alpha = 0.6, label="Change in gage height")
ax2.set_ylabel("Change in gage height [ft]", size=13)
plt.grid(ls = "dotted",alpha = 0.5,color="tab:blue")
plt.legend(loc = 1)
plt.title("Change in discharge and in gage height plotted against time",size=16)'''
from scipy.signal import argrelextrema, find_peaks
plt.figure(figsize=[15,5])
plt.plot(df.datetime,df.discharge, color="tab:orange", label="Discharge")
#p = argrelextrema(df.discharge.to_numpy(), np.greater, order = 5)[0]
p,_ = find_peaks(df.discharge,threshold=None, height = 2000, prominence = 700)
plt.plot(df.datetime.to_numpy()[p],df.discharge.to_numpy()[p], "o", label="Possible rain",color="tab:green")
plt.ylabel("Discharge [$ft^3$ /s ]",size=13)
plt.legend()
#plt.ylim(0,3000)
plt.title("Possible rains based on sudden increase in discharge",size=16)
len(p)
fig = go.Figure()
fig.add_trace(go.Scatter(x=df.datetime,
y=df.discharge,
name = "Discharge",
line={ "color" : "#FF7F0E", "width": 2}))
p,_ = find_peaks(df.discharge,threshold=None, height = 500, prominence = 500)
#Plot only the points, where peaks were found
fig.add_trace(go.Scatter(x = pd.to_datetime(df.datetime.to_numpy()[p]),
y = df.discharge.to_numpy()[p],
mode='markers',
name='Possible rain',
marker = { "color" : "#2CA02C"}))
fig.update_layout(yaxis_title="Discharge [ft^3 /s ]")
#fig.update_xaxes(range=[pd.to_datetime("2014-01-01 00:00:00"), pd.to_datetime("2015-12-31 23:45:00")])
#This was my test interval
fig.show()
len(p)
Water discharge increases significantly during rain producing maxima in the time series. Plot the distribution of maximum values and fit with an appropriate function. Bring arguments to support the choice of probabilistic model.
plt.figure(figsize=[8,6])
sns.distplot(df.discharge.diff().to_numpy()[p],label="raw data with KDE", bins = 20, hist_kws = {"range" : [0,1000]})
plt.xlabel("Discharge [$ft^3$/s]",size=13)
fitPlotDist(gumbel_r,df.discharge.diff().to_numpy()[p],np.linspace(0,3000,1000))
fitPlotDist(pareto,df.discharge.diff().to_numpy()[p],np.linspace(0,3000,1000))
plt.ylim(0,0.02)
plt.xlim(0,1000)
plt.legend()
We can see that the Gumbel distribution is a good fit of the tail of the data. As it is stated in the Wiki article:
This distribution might be used to represent the distribution of the maximum level of a river in a particular year if there was a list of maximum values for the past ten years.
Although the article refers to water levels (gage height feature), this hold true for the discharge as well, because of the strong positive correlation of the features.
from scipy.stats import pearsonr
pearsonr(df[df.isna().sum(axis=1) < 1].discharge, df[df.isna().sum(axis=1) < 1].gage)
Once rainy events are detected, plot the distribution of the length of sunny intervals between rains. Fit the distribution with an appropriate function.
t_rain = pd.Series(df.datetime.to_numpy()[p]).diff()
t_rain_hours = t_rain.map(lambda x: (x.days * 24* 3600 + x.seconds)/3600)
sns.distplot(t_rain_hours, bins = 100, label="raw data with KDE")
fitPlotDist(pareto, t_rain_hours[ t_rain_hours.notna()], np.linspace(0,2000,1000))
fitPlotDist(expon, t_rain_hours[ t_rain_hours.notna()], np.linspace(0,2000,1000))
fitPlotDist(gompertz, t_rain_hours[ t_rain_hours.notna()], np.linspace(0,2000,1000), alpha = 0.5)
plt.ylim(0,0.0065)
plt.legend()
plt.xlim(0,2000)
plt.xlabel("sunny hours between rains [h]")
We can see on the graph, that the Gompertz and the exponential distribution is a pretty good fit for tail of the distribution, but the first part is much better fit by the Pareto distribution.
What is the maximum of water discharge in an arbitrarily chosen period of one year? Calculate the maximum of water discharge due to rain in a rolling window of 1 year, plot its distribution and fit with an appropriate function.
df3 = deepcopy(df)
df3 = df3.set_index("datetime") #modifying index to the dates, so that the rolling window works
sns.displot(df3.rolling("365D").max().discharge, bins = 30, kde=False, stat="density" )
fitPlotDist(gumbel_r,df3.rolling("365D").max().discharge, np.linspace(0,4*10**4, 1000))
fitPlotDist(lognorm,df3.rolling("365D").max().discharge, np.linspace(0,4*10**4, 1000))
plt.ylabel("density")
plt.legend()
This distribution could also be able to be fit by the Gumbel distribution. It is similarly an extreme value distribution as it was described before at Excercise 5.
How many time does it rain in a month? (Again use a rolling window function) Calculate and plot the distribution and fit with an appropriate function.
from scipy.optimize import curve_fit
max_rain = 20
df3["isRainy"] = [ 1 if i in p else 0 for i in range(len(df3))] #Encode rainy days in a series
hist = plt.hist(df3.rolling("30D").isRainy.sum(),bins = np.arange(max_rain)-0.5,
label="raw data", density = True)
#Averaging neighbouring bins
bin_mids = [ (hist[1][i] + hist[1][i+1]) *0.5 for i in range( len(hist[1]) -1 )]
#Fitting the Poisson distribution
def fit_func(k, lambd):
return poisson.pmf(k, lambd)
params, cov_matr = curve_fit(fit_func, bin_mids, hist[0])
#Plotting
plt.vlines(bin_mids, 0, fit_func(bin_mids,params), label=f"Poisson distribution $\lambda=${np.round(params[0],3)}")
plt.plot(bin_mids, fit_func(bin_mids,params), color="black", marker ="o", ls = "")
plt.ylabel("density")
plt.legend()
plt.grid()
Poisson distribution expresses the probability of a given number of events occurring in a fixed interval of time or space, if these events occur with a known constant mean rate and independently of the time since the last event. Although the number of rainy events might be different by seasons.
Find the measuring station you used in the excercises above on the map. Find another measurement station about 100-200 kms from it and download the data. Try to estimate the typical time it takes for weather fronts to travel the distance between the two measuring stations.
!curl "https://nwis.waterdata.usgs.gov/nwis/uv/?parameterCd=00060,00065&format=rdb&site_no=01510000&period=&begin_date=20100101&end_date=20191231&siteStatus=all" > data2.csv
!head -n 40 data2.csv
df2 = pd.read_csv("data2.csv",sep='\t', comment='#',header=0)
df2 = df2.drop(index=0)
df2.head()
print(df2.agency_cd.unique(), df2.site_no.unique(), df2.tz_cd.unique())
df2 = df2.drop(columns = ["agency_cd", "site_no","tz_cd"])
df2.head()
df2.dtypes
df2 = df2.rename(columns = {"107451_00065": "gage2", "107451_00065_cd": "gage_c2", "107450_00060": "discharge2", "107450_00060_cd":"discharge_c2"})
df2["datetime"] = pd.to_datetime(df2.datetime)
df2["discharge2"] = df2.discharge2.astype(float)
df2["gage2"] = df2.gage2.astype(float)
df2.head(), df2.dtypes
print(df2.isna().sum()/len(df2))
#Let's get rid of the missing data
df2 = df2[df2.discharge2.isna() != True]
print("\n",df2.isna().sum()/len(df2))
plt.figure(figsize=[15,5])
plt.plot(df2.datetime,df2.discharge2, color="tab:orange", label="Discharge")
p2 = argrelextrema(df2.discharge2.to_numpy(), np.greater, order = 650)[0]
plt.plot(df2.datetime.to_numpy()[p2],df2.discharge2.to_numpy()[p2], "o", label="possible rain",color="tab:green")
plt.ylabel("Discharge [$ft^3$ /s ]",size=13)
plt.legend()
plt.title("The second measuring station's discharge values",size=16)
fig = go.Figure()
fig.add_trace(go.Scatter(x=df2.datetime,
y=df2.discharge2,
name = "Discharge",
line={ "color" : "#FF7F0E", "width": 2}))
p2,_ = find_peaks(df2.discharge2,threshold=None, height = 300, prominence = 300)
#Plot only the points, where peaks were found
fig.add_trace(go.Scatter(x = pd.to_datetime(df2.datetime.to_numpy()[p2]),
y = df2.discharge2.to_numpy()[p2],
mode='markers',
name='Possible rain',
marker = { "color" : "#2CA02C"}))
fig.update_layout(yaxis_title="Discharge [ft^3 /s]")
fig.update_layout(title="The second measuring station's discharge values")
#fig.update_xaxes(range=[pd.to_datetime("2014-01-01 00:00:00"), pd.to_datetime("2015-12-31 23:45:00")])
#This was my test interval
fig.show()
len(p2)
Cross correlation can identify the similarity of two signals. The degree of similarity can be qualitatively established, by looking at the result, the correlogram. Based on how symmetric the correlogram is we can decide if the two signals are similar, and quantitatively measure the lag of these similar parts.
I used this method to determine the lag of rainy events between the two measuring stations.
len(df), len(df2) #We can see that the two dataframes have different amounts of data.
#This can be attributed to some missing date
#Joining the tables on the common dates
df_merged = pd.merge(df, df2, how='inner', left_on='datetime', right_on='datetime')
df_merged.head(), df_merged.shape
#using scipy's crosscorrelation
from scipy.signal import correlate
sc_cc = correlate(df_merged.discharge/max(df_merged.discharge), df_merged.discharge2/max(df_merged.discharge2),
mode="full")
fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(-len(sc_cc)//2, len(sc_cc)//2),
y=sc_cc,
name = "Discharge",
line={ "color" : "#FF7F0E", "width": 2}))
print(np.argmax(sc_cc))
fig.update_layout(yaxis_title="Cross-correlation",
xaxis_title="Lag")
fig.show()